class: center, middle, inverse, title-slide .title[ # Clustering ] .author[ ### Eva Freyhult, Mun-Gwan Hong ] .institute[ ### NBIS, SciLifeLab ] .date[ ### 2022-09-15 ] --- class: spaced <style type="text/css"> quote { display: block; font-family: sans-serif; text-indent: 10px; font-size: 13px; color: green; line-height: 1.2; } bold { font-style: bold; color: blue; } smaller { font-family: sans-serif; font-size: 15px; } </style> # Clustering <img src="data:image/png;base64,#/Users/mun-gwan.hong/Documents/Teaching/Biostatistic/workshop-mlbiostatistics/session-clustering/html/presentation_files/figure-html/clusters-1.png" width="70%" style="display: block; margin: auto;" /> Clustering is about grouping objects together according to similarity. The objects are grouped into clusters so that objects within the same cluster are more similar to one another than to objects in other clusters. --- # Examples .pull-left[  <font size="2">January 2021</font>  ] .pull-right[ * Diabetes. Type 1 and 2. Any subtypes? * Clustering * Ahlqvist _et al._: <quote> "Model <bold> variables were selected </bold> on the premise that patients develop diabetes when they can no longer increase their insulin secretion..." </quote> <quote> "...TwoStep clustering, in which the first step estimates the optimal number of clusters on the basis of <bold> silhouette width </bold> and the second step does <bold> hierarchical clustering </bold>, was done in SPSS version 23 for two to 15 clusters using log-likelihood as a <bold>distance</bold> measure and Schwarz’s Bayesian criterion for clustering. <bold>k-means clustering</bold> was done with a k value of 4..." </quote> * Wagner _et al._: <quote> "We used <bold>partitioning on variables </bold> derived from oral glucose tolerance tests, MRI-measured body fat distribution, liver fat content and genetic risk" </quote> <quote> "...<bold>distances</bold> were computed as Gower distances using standardized variables..." "To find the optimal cluster count, we evaluated the dendogram and <bold>silhouette-widths</bold>. The clustering procedure was performed with the partitioning around medoids (pam) method in the R-package ‘cluster’, which is a more robust version of <bold>k-means clustering</bold>." </quote> ] --- # Clustering .pull-left[ Clustering for a set of `\(n\)` objects (or observations) is usually performed based on a selection of `\(p\)` variables (or measurements). A single observation `\(i\)` can thus be described by the vector `\({\mathbf x}_i = [x_{i1}, x_{i2}, \dots, x_{ip}]\)`. Clustering is commonly used for **exploratory data analysis** and to identify **sub-structures** in a data set. There are many types of clustering algorithms, here we will only discuss two of them, *K-means* and *hierarchical* clustering. ] .pull-right[ <!-- --> ] --- layout: true # K-means --- The K-means clustering aims to divide all objects into exactly `\(K\)` clusters. `\(K\)` has to be given to the algorithm. The algorithm minimize the variance within clusters, by iteratively assigning each object to the cluster with the closest centroid. The centroid of cluster `\(k\)` is the arithmetic mean of all `\(n_k\)` objects in the cluster. `$${\mathbf m}_k = \frac{1}{n_k} \sum_{i=1}^{n_k} {\mathbf x}_{i}$$` --- The Lloyd and Forgy's algorithm can be performed as follows; .pull-left[ 1. Initialization. Select `\(k\)` initial centroids. </br> <smaller>The initial centroids can be selected in several different ways. Two common methods are;</smaller> * <smaller>Select <span style="font-style: italic;">k</span> data points as initial centroids</smaller> * <smaller>Randomly assign each data point to one out of <span style="font-style: italic;">k</span> clusters and compute the centroids for these initial clusters.</smaller> ] .pull-right[ <!-- --> ] --- The algorithm can be performed as follows; .pull-left[ 1. Initialization. Select `\(k\)` initial centroids. 2. Assign each object to the closest centroid (in terms of squared Euclidean distance). The squared Euclidean distance between an object (a data point) and a cluster centroid `\(m_k\)` is `\(d_i = \sum_{j=1}^p (x_{ij} - m_{kj})^2\)`. By assigning each object to closest centroid the total within cluster sum of squares (WSS) is minimized. `$$WSS = \sum_{k=1}^K\sum_{i \in C_k}\sum_{j=1}^p (x_{ij} - m_{kj})^2$$` ] .pull-right[ <!-- --> ] --- The algorithm can be performed as follows; .pull-left[ 1. Initialization. Select `\(k\)` initial centroids. 2. Assign each object to the closest centroid (in terms of squared Euclidean distance). By assigning each object to closest centroid the total within cluster sum of squares (WSS) is minimized. 3. Update the centroids for each of the `\(k\)` clusters by computing the centroid for all the objects in each of the clusters. ] .pull-right[ <!-- --> ] --- The algorithm can be performed as follows; .pull-left[ 1. Initialization. Select `\(k\)` initial centroids. 2. Assign each object to the closest centroid (in terms of squared Euclidean distance). By assigning each object to closest centroid the total within cluster sum of squares (WSS) is minimized. 3. Update the centroids for each of the `\(k\)` clusters by computing the centroid for all the objects in each of the clusters. 4. Repeat 2-3 until convergence ] .pull-right[ <!-- --> ] --- Repeat until convergence .pull-right[ <!-- --> ] --- Repeat until convergence .pull-right[ <!-- --> ] --- Repeat until convergence .pull-right[ <!-- --> ] --- Repeat until convergence .pull-right[ <!-- --> ] --- Repeat until convergence .pull-right[ <!-- --> ] --- Repeat until convergence .pull-right[ <!-- --> ] --- Convergence is reached. .pull-right[ <!-- --> ] --- layout:false layout:true # K-means --- layout: true # Choosing the number of clusters --- K-means clustering requires that we specify the number of clusters, `\(k\)`. How do we select such a `\(k\)`? --- ### Elbow method <!-- The Elbow method is based on the total within sum of squares, `\(WSS\)`, that the K-means algorithm tries to minimize. By running the algorithm with several different values of `\(k\)`, e.g. 1--10, we can plot WSS as a function of `\(k\)`. --> <!-- --> <!-- WSS is constantly decreasing as `\(k\)` increases. The elbow methods suggests that The inflection (bend, elbow) on the curve indicate an optimal number of clusters. In this case, the elbow method suggests that the optimal number of clusters is `\(k=3\)`. --> --- ### Silhouette method The silhouette value for a single object `\(i\)` is a value between -1 ans 1 that measure how similar the object is to other objects in the same cluster as compared to how similar it is to objects in other clusters. The average silhouette over all objects is a measure of how good the clustering is, the higher the value the better is the clustering. <img src="data:image/png;base64,#/Users/mun-gwan.hong/Documents/Teaching/Biostatistic/workshop-mlbiostatistics/session-clustering/html/presentation_files/figure-html/silhouette-1.png" width="70%" /> --- ### Silhouette method The silhouette value, `\(s(i)\)`, for an object `\(i\)` in cluster `\(C_a\)` is calculated as follows; .pull-right[ <!-- --> ] --- ### Silhouette method The silhouette value, `\(s(i)\)`, for an object `\(i\)` in cluster `\(C_a\)` is calculated as follows; .pull-left[ 1. Average distance between `\(i\)` and all other objects in <span style="color: blue;">the same cluster `\(C_a\)`</span>; `$$a(i) = \frac{1}{|C_a|-1} \sum_{j \neq i, j \in C_a} d(i, j)$$` 2. Average distance between `\(i\)` and all objects in <span style="color: red;">another cluster `\(C_b\)`, `\(d(i,C_b)\)`</span> and define the minimum; `$$b(i) = \min_{b \neq a} d(i, C_b)$$` 3. The Silhouette index is defined as; `$$s(i) = \frac{b(i) - a(i)}{max(a(i), b(i))}$$` ] .pull-right[ <!-- --> ] A silhouette value close to 1 means that the object is very well clustered, a value close to 0 means that the object lies between two clusters. A negative silhouette value means that the object is incorrectly clustered. --- layout: false name: diss layout: true # Dissimilarity matrix --- <!-- All clustering algorithms need a measure of similarity or dissimilarity between objects. As a similarity can be transformed to a dissimilarity, we will here focus on dissimilaities. --> <!-- Dissimilarities between all pairs of objects can be described in a dissimilarity matrix. Most algorithms are based on symmetric dissimilarities, i.e. when the dissimilarity between a and b is the same as between b and a. Also, most algorithm require non-negative dissimilarities. --> K-means uses the squared Euclidean distance as a dissimilarity measure, but there of course other ways to measure the dissimilarity between two objects (data points). <!-- An objects can usually be described by a set of `\(p\)` measurements, for `\(n\)` objects we have the measurements `\(x_{ij},\)` where `\(i=1,\dots,n\)` and `\(j=1,\dots,p\)`. --> --- Common dissimilarity measures include; *Euclidean distance* `$$d_{euc} (x, y) = \sqrt{\sum_{j=1}^{p} (x_j - y_j)^2}$$` -- *Squared Euclidean distance* `$$d_{eucsq} (x, y) = \sum_{j=1}^{p} (x_j - y_j)^2$$` --- *Manhattan distance* `$$d_{man} (x, y) = \sqrt{\sum_{j=1}^{p} |x_j - y_j|}$$` --- *Pearson correlation distance* Pearson's correlation is a similarity measure `$$r = \frac{\sum_{j=1}^p(x_j-\bar x)(y_i-\bar y)}{\sqrt{\sum_{j=1}^p(x_j-\bar x)^2\sum_{j=1}^p(y_j-\bar y)^2}}$$` Using a transformation we can compute a Pearson's correlation distance `$$d_{pear}(x,y) = \sqrt{1-r}$$` --- layout: false layout: true # Hierarchical clustering --- Hierarchical clustering does not require the number of clusters to be specified. Instead of creating a single set of clusters it creates a hierarchy of clusterings based on pairwise dissimilarities. <img src="data:image/png;base64,#/Users/mun-gwan.hong/Documents/Teaching/Biostatistic/workshop-mlbiostatistics/session-clustering/html/presentation_files/figure-html/hclust0-1.png" width="80%" /> --- There are two strategies for hierarchical clustering *agglomerative* (bottom-up) and *divisive* (top-down). --- ## Agglomerative <!-- --> --- ## Divisive <!-- --> --- layout: false name: agglo layout: true # Agglomerative clustering --- * Starts with all objects in separate clusters * Merge based on smallest dissmiliarity * Dissimilarity between clusters is defined using a **linkage method** The dissimilarity between two clusters A and B with objects `\(a_1, \dots, a_{n_A}\)` and `\(b_1, \dots, b_{n_B}\)`, respectively, can be computed using one of several linkage methods. --- ### Single linkage .pull-left[ Single linkage takes as a cluster dissimilarity the distance between the two closest objects in the two clusters.] .pull-right[ `$$d_{sl}(A, B) = \min_{i,j} d(a_i, b_j)$$` ] <!-- --> --- ### Complete linkage .pull-left[Complete linkage takes as a cluster dissimilarity the distance between the two objects furthest apart in the two clusters.] .pull-right[ `$$d_{cl}(A, B) = \max_{i,j} d(a_i, b_j)$$` ] <!-- --> --- ### Average linkage .pull-left[Average linkage takes as a cluster dissimilarity the average distance between the objects in the two clusters.] .pull-right[ `$$d_{al}(A, B) = \frac{1}{n_A n_B}\sum_i\sum_j d(a_i, b_j)$$` ] <!-- --> --- ### Ward's linkage Ward's linkage method minimize the within variance, by merging clusters with the minimum increase in within sum of squares. `$$d_{wl}(A, B) = \sum_{i=1}^{n_A} (a_i - m_{A \cup B})^2 + \sum_{i=1}^{n_B} (b_i - m_{A \cup B})^2 - \sum_{i=1}^{n_A} (a_i - m_{A})^2 - \sum_{i=1}^{n_B} (b_i - m_{B})^2$$` where `\(m_A, m_B, m_{A \cup B}\)` are the center of the clusters `\(A\)`, `\(B\)` and `\(A \cup B\)`, respectively. Note that Ward's linkage method should not be combined with any dissimilarity matrix as it is based on the squared Euclidean distance. In the R function `hclust` either the Euclidean or squared Euclidean distance can be used in combination with the linkage `method='ward.D'` or `method='ward.D2`, respectively. --- #### Ward's linkage <!-- --> --- layout: false # Heatmap <img src="data:image/png;base64,#/Users/mun-gwan.hong/Documents/Teaching/Biostatistic/workshop-mlbiostatistics/session-clustering/html/presentation_files/figure-html/heatmap-1.png" width="65%" style="display: block; margin: auto;" /> This heatmap was created using the `pheatmap` package in R.